{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advantage Actor-Critic (A2C) on CartPole"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Actor-critic is an algorithm that combines both policy gradient (the actor) and value function (the critic)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![A2C](imgs/Advantage_actor_critic.png)\n",
    "Credit: Sergey Levine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A2C is a more sophisticated version of the actor-critic that use the advantage, n-step return and a policy is run in multiple (synchronous) environments. \n",
    "[A3C](https://arxiv.org/pdf/1602.01783.pdf) is an asynchronous A2C with the environments that are run in parallel. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Actor and Critic can share the same neural network or have two separate network design. In this example, I used a shared network.\n",
    "<img src=\"imgs/nn_ac.png\" alt=\"drawing\" width=\"600\"/>\n",
    "Credit: Sergey Levine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import gym\n",
    "from tensorboardX import SummaryWriter\n",
    "\n",
    "import datetime\n",
    "from collections import namedtuple\n",
    "from collections import deque\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import torch.optim as optim\n",
    "from torch.nn.utils.clip_grad import clip_grad_norm_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class A2C_nn(nn.Module):\n",
    "    '''\n",
    "    Advantage actor-critic neural net\n",
    "    '''\n",
    "\n",
    "    def __init__(self, input_shape, n_actions):\n",
    "        super(A2C_nn, self).__init__()\n",
    "\n",
    "        self.lp = nn.Sequential(\n",
    "            nn.Linear(input_shape[0], 64),\n",
    "            nn.ReLU())\n",
    "\n",
    "        self.policy = nn.Linear(64, n_actions)\n",
    "        self.value = nn.Linear(64, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        l = self.lp(x.float())\n",
    "        # return the actor and the critic\n",
    "        return self.policy(l), self.value(l)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The total loss contains:\n",
    "- actor loss $\\partial\\theta_v\\leftarrow\\partial\\theta_v + \\dfrac{\\partial(R-V_\\theta(s))^2}{\\partial\\theta_v}$\n",
    "- policy loss $\\partial\\theta_\\pi\\leftarrow\\partial\\theta_\\pi + \\alpha\\triangledown_\\theta log\\pi_\\theta(a|s)(R-V_\\theta(s))$\n",
    "- entropy loss $\\beta\\sum_i\\pi_\\theta(s)log\\pi_\\theta(s)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_loss(memories, nn, writer):\n",
    "    '''\n",
    "    Calculate the loss of the memories\n",
    "    '''\n",
    "\n",
    "    #batch_mem = np.random.choice(len(memories), size=32)\n",
    "\n",
    "    rewards = torch.tensor(np.array([m.reward for m in memories], dtype=np.float32))\n",
    "    log_val = nn(torch.tensor(np.array([m.obs for m in memories], dtype=np.float32)))\n",
    "\n",
    "    act_log_softmax = F.log_softmax(log_val[0], dim=1)[:,np.array([m.action for m in memories])]\n",
    "    # Calculate the advantage\n",
    "    adv = (rewards - log_val[1].detach())\n",
    "\n",
    "    # actor loss (policy gradient)\n",
    "    pg_loss = - torch.mean(act_log_softmax * adv)\n",
    "    # critic loss (value loss)\n",
    "    vl_loss = F.mse_loss(log_val[1].squeeze(-1), rewards)\n",
    "    # entropy loss\n",
    "    entropy_loss = ENTROPY_BETA * torch.mean(torch.sum(F.softmax(log_val[0], dim=1) * F.log_softmax(log_val[0], dim=1), dim=1))\n",
    "\n",
    "    # total loss\n",
    "    loss = pg_loss + vl_loss - entropy_loss\n",
    "\n",
    "    # add scalar to the writer\n",
    "    writer.add_scalar('loss', float(loss), n_iter)\n",
    "    writer.add_scalar('pg_loss', float(pg_loss), n_iter)\n",
    "    writer.add_scalar('vl_loss', float(vl_loss), n_iter)\n",
    "    writer.add_scalar('entropy_loss', float(entropy_loss), n_iter)\n",
    "    writer.add_scalar('actions', np.mean([m.action for m in memories]), n_iter)\n",
    "    writer.add_scalar('adv', float(torch.mean(adv)), n_iter)\n",
    "    writer.add_scalar('act_lgsoft', float(torch.mean(act_log_softmax)), n_iter)\n",
    "\n",
    "    return loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Env:\n",
    "    '''\n",
    "    Environment class. Used to deal with multiple environments\n",
    "    '''\n",
    "\n",
    "    game_rew = 0\n",
    "    last_game_rew = 0\n",
    "\n",
    "    def __init__(self, env_name, n_steps, gamma):\n",
    "        super(Env, self).__init__()\n",
    "\n",
    "        # create the new environment\n",
    "        self.env = gym.make(env_name)\n",
    "        self.obs = self.env.reset()\n",
    "\n",
    "        self.n_steps = n_steps\n",
    "        self.action_n = self.env.action_space.n\n",
    "        self.observation_n = self.env.observation_space.shape[0]\n",
    "        self.gamma = gamma\n",
    "\n",
    "    def step(self, agent):\n",
    "        '''\n",
    "        Execute the agent n_steps in the environment\n",
    "        '''\n",
    "        memories = []\n",
    "        for s in range(self.n_steps):\n",
    "\n",
    "            # get the agent policy\n",
    "            pol_val = agent(torch.tensor(self.obs))\n",
    "            s_act = F.softmax(pol_val[0])\n",
    "\n",
    "            # get an action following the policy distribution\n",
    "            action = int(np.random.choice(np.arange(self.action_n), p=s_act.detach().numpy(), size=1))\n",
    "\n",
    "            # Perform a step in the environment\n",
    "            new_obs, reward, done, _ = self.env.step(action)\n",
    "\n",
    "            # update the memory\n",
    "            memories.append(Memory(obs=self.obs, action=action, new_obs=new_obs, reward=reward, done=done))\n",
    "\n",
    "            self.game_rew += reward\n",
    "            self.obs = new_obs\n",
    "\n",
    "            if done:\n",
    "                # if done reset the env and the variables\n",
    "                self.done = True\n",
    "                # if the game is over, run_add take the 0 value\n",
    "                self.run_add = 0\n",
    "                self.obs = self.env.reset()\n",
    "\n",
    "                self.last_game_rew = self.game_rew\n",
    "                self.game_rew = 0\n",
    "                break\n",
    "            else:\n",
    "                self.done = False\n",
    "\n",
    "        if not self.done:\n",
    "            # if the game isn't over, run_add take the value of the last state\n",
    "            self.run_add = float(agent(torch.tensor(self.obs))[1])\n",
    "\n",
    "        # compute the discount reward of the memories and return it\n",
    "        return self.discounted_rewards(memories)\n",
    "\n",
    "\n",
    "    def discounted_rewards(self, memories):\n",
    "        '''\n",
    "        Compute the discounted reward backward\n",
    "        '''\n",
    "        upd_memories = []\n",
    "\n",
    "        for t in reversed(range(len(memories))):\n",
    "            if memories[t].done: self.run_add = 0\n",
    "            self.run_add = self.run_add * self.gamma + memories[t].reward\n",
    "\n",
    "            # Update the memories with the discounted reward\n",
    "            upd_memories.append(Memory(obs=memories[t].obs, action=memories[t].action, new_obs=memories[t].new_obs, reward=self.run_add, done=memories[t].done))\n",
    "\n",
    "        return upd_memories[::-1]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Memory = namedtuple('Memory', ['obs', 'action', 'new_obs', 'reward', 'done'], verbose=False, rename=False)\n",
    "\n",
    "# Hyperparameters\n",
    "GAMMA = 0.95\n",
    "LEARNING_RATE = 0.003\n",
    "ENTROPY_BETA = 0.01\n",
    "ENV_NAME = 'CartPole-v0'\n",
    "\n",
    "MAX_ITER = 100000\n",
    "# Number of the env\n",
    "N_ENVS = 40\n",
    "\n",
    "# Max normalized gradient\n",
    "CLIP_GRAD = 0.1\n",
    "\n",
    "device = 'cpu'\n",
    "\n",
    "now = datetime.datetime.now()\n",
    "date_time = \"{}_{}.{}.{}\".format(now.day, now.hour, now.minute, now.second)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create N_ENVS environments\n",
    "envs = [Env(ENV_NAME, 1, GAMMA) for _ in range(N_ENVS)]\n",
    "\n",
    "writer = SummaryWriter(log_dir='content/runs/A2C'+ENV_NAME+'_'+date_time)\n",
    "\n",
    "# initialize the actor-critic NN\n",
    "agent_nn = A2C_nn(gym.make(ENV_NAME).observation_space.shape, gym.make(ENV_NAME).action_space.n).to(device)\n",
    "\n",
    "# Adam optimizer\n",
    "optimizer = optim.Adam(agent_nn.parameters(), lr=LEARNING_RATE, eps=1e-3)\n",
    "\n",
    "experience = []\n",
    "n_iter = 0\n",
    "\n",
    "while n_iter < MAX_ITER:\n",
    "    n_iter += 1\n",
    "\n",
    "    # list containing all the memories\n",
    "    memories = [mem for env in envs for mem in env.step(agent_nn)]\n",
    "\n",
    "    # calculate the loss\n",
    "    losses = calculate_loss(memories, agent_nn, writer)\n",
    "\n",
    "    # optimizer step\n",
    "    optimizer.zero_grad()\n",
    "    losses.backward()\n",
    "    # clip the gradient\n",
    "    clip_grad_norm_(agent_nn.parameters(), CLIP_GRAD)\n",
    "    optimizer.step()\n",
    "\n",
    "\n",
    "    writer.add_scalar('rew', np.mean([env.last_game_rew for env in envs]), n_iter)\n",
    "    print(n_iter, np.round(float(losses),2), 'rew:', np.round(np.mean([env.last_game_rew for env in envs]),2))\n",
    "\n",
    "writer.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ATTENTION! the model is not working, look at the graph below. Why this strange behavior? I tried to tune the hyperparameters but the results are the same.\n",
    "![Reward plot](imgs/reward_plot_a2c.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Why is the loss decreasing so fast? \n",
    "![Reward plot](imgs/loss_plot_a2c.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### In some cases, the model start preferring always the same action..\n",
    "![Reward plot](imgs/actions_plot_a2c.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some idea:\n",
    " - Use two different neural networks and optimizer"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
